In this tutorial we will use the Keras deep learning library to construct a simple Recurrent Neural Network (RNN) that can learn linguistic structure from a piece of text, and use that knowledge to generate new text passages. To review general RNN architecture, specific types of RNN networks such as the LSTM networks we'll be using here, and other concepts behind this type of machine learning, you should consult the following resources:
This code is an adaptation of these two examples:
You can consult the original sites for more information and documentation.
Let's start by importing some of the libraries we'll be using in this lab:
In [ ]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils
from time import gmtime, strftime
import os
import re
import pickle
import random
import sys
The first thing we need to do is generate our training data set. In this case we will use a recent article written by Barack Obama for The Economist newspaper. Make sure you have the obama.txt
file in the /data
folder within the /week-6
folder in your repository.
In [ ]:
# load ascii text from file
filename = "data/obama.txt"
raw_text = open(filename).read()
# get rid of any characters other than letters, numbers,
# and a few special characters
raw_text = re.sub('[^\nA-Za-z0-9 ,.:;?!-]+', '', raw_text)
# convert all text to lowercase
raw_text = raw_text.lower()
n_chars = len(raw_text)
print("length of text:", n_chars)
print("text preview:", raw_text[:500])
Next, we use python's set()
function to generate a list of all unique characters in the text. This will form our 'vocabulary' of characters, which is similar to the categories found in typical ML classification problems.
Since neural networks work with numerical data, we also need to create a mapping between each character and a unique integer value. To do this we create two dictionaries: one which has characters as keys and the associated integers as the value, and one which has integers as keys and the associated characters as the value. These dictionaries will allow us to do translation both ways.
In [ ]:
# extract all unique characters in the text
chars = sorted(list(set(raw_text)))
n_vocab = len(chars)
print("number of unique characters found:", n_vocab)
# create mapping of characters to integers and back
char_to_int = dict((c, i) for i, c in enumerate(chars))
int_to_char = dict((i, c) for i, c in enumerate(chars))
# test our mapping
print('a', "- maps to ->", char_to_int["a"])
print(25, "- maps to ->", int_to_char[25])
Now we need to define the training data for our network. With RNN's, the training data usually takes the shape of a three-dimensional matrix, with the size of each dimension representing:
[# of training sequences, # of training samples per sequence, # of features per sample]
1
, and all others by 0
.To prepare the data, we first set the length of training sequences we want to use. In this case we will set the sequence length to 100, meaning the RNN layer will be able to predict future characters based on the 100 characters that came before.
We will then slide this 100 character 'window' over the entire text to create input
and output
arrays. Each entry in the input
array contains 100 characters from the text, and each entry in the output
array contains the single character that came after.
In [ ]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 100
inputs = []
outputs = []
for i in range(0, n_chars - seq_length, 1):
inputs.append(raw_text[i:i + seq_length])
outputs.append(raw_text[i + seq_length])
n_sequences = len(inputs)
print("Total sequences: ", n_sequences)
Now let's shuffle both the input and output data so that we can later have Keras split it automatically into a training and test set. To make sure the two lists are shuffled the same way (maintaining correspondance between inputs and outputs), we create a separate shuffled list of indeces, and use these indeces to reorder both lists.
In [ ]:
indeces = list(range(len(inputs)))
random.shuffle(indeces)
inputs = [inputs[x] for x in indeces]
outputs = [outputs[x] for x in indeces]
Let's visualize one of these sequences to make sure we are getting what we expect:
In [ ]:
print(inputs[0], "-->", outputs[0])
Next we will prepare the actual numpy datasets which will be used to train our network. We first initialize two empty numpy arrays in the proper formatting:
We then iterate over the arrays we generated in the previous step and fill the numpy arrays with the proper data. Since all character data is formatted using one-hot encoding, we initialize both data sets with zeros. As we iterate over the data, we use the char_to_int
dictionary to map each character to its related position integer, and use that position to change the related value in the data set to 1
.
In [ ]:
# create two empty numpy array with the proper dimensions
X = np.zeros((n_sequences, seq_length, n_vocab), dtype=np.bool)
y = np.zeros((n_sequences, n_vocab), dtype=np.bool)
# iterate over the data and build up the X and y data sets
# by setting the appropriate indices to 1 in each one-hot vector
for i, example in enumerate(inputs):
for t, char in enumerate(example):
X[i, t, char_to_int[char]] = 1
y[i, char_to_int[outputs[i]]] = 1
print('X dims -->', X.shape)
print('y dims -->', y.shape)
Next, we define our RNN model in Keras. This is very similar to how we defined the CNN model, except now we use the LSTM()
function to create an LSTM layer with an internal memory of 128 neurons. LSTM is a special type of RNN layer which solves the unstable gradients issue seen in basic RNN. Along with LSTM layers, Keras also supports basic RNN layers and GRU layers, which are similar to LSTM. You can find full documentation for recurrent layers in Keras' documentation
As before, we need to explicitly define the input shape for the first layer. Also, we need to tell Keras whether the LSTM layer should pass its sequence of predictions or its internal memory as the output to the next layer. If you are connecting the LSTM layer to a fully connected layer as we do in this case, you should set the return_sequences
parameter to False
to have the layer pass the value of its hidden neurons. If you are connecting multiple LSTM layers, you should set the parameter to True
in all but the last layer, so that subsequent layers can learn from the sequence of predictions of previous layers.
We will use dropout with a probability of 50% to regularize the network and prevent overfitting on our training data. The output of the network will be a fully connected layer with one neuron for each character in the vocabulary. The softmax function will convert this output to a probability distribution across all characters.
In [ ]:
# define the LSTM model
model = Sequential()
model.add(LSTM(128, return_sequences=False, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.50))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
Next, we define two helper functions: one to select a character based on a probability distribution, and one to generate a sequence of predicted characters based on an input (or 'seed') list of characters.
The sample()
function will take in a probability distribution generated by the softmax()
function, and select a character based on the 'temperature' input. The temperature (also often called the 'diversity') effects how strictly the probability distribution is sampled.
In [ ]:
def sample(preds, temperature=1.0):
# helper function to sample an index from a probability array
preds = np.asarray(preds).astype('float64')
preds = np.log(preds) / temperature
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)
probas = np.random.multinomial(1, preds, 1)
return np.argmax(probas)
The generate()
function will take in:
and print the resulting sequence of characters to the screen.
In [ ]:
def generate(sentence, prediction_length=50, diversity=0.35):
print('----- diversity:', diversity)
generated = sentence
sys.stdout.write(generated)
# iterate over number of characters requested
for i in range(prediction_length):
# build up sequence data from current sentence
x = np.zeros((1, X.shape[1], X.shape[2]))
for t, char in enumerate(sentence):
x[0, t, char_to_int[char]] = 1.
# use trained model to return probability distribution
# for next character based on input sequence
preds = model.predict(x, verbose=0)[0]
# use sample() function to sample next character
# based on probability distribution and desired diversity
next_index = sample(preds, diversity)
# convert integer to character
next_char = int_to_char[next_index]
# add new character to generated text
generated += next_char
# delete the first character from beginning of sentance,
# and add new caracter to the end. This will form the
# input sequence for the next predicted character.
sentence = sentence[1:] + next_char
# print results to screen
sys.stdout.write(next_char)
sys.stdout.flush()
print()
Next, we define a system for Keras to save our model's parameters to a local file after each epoch where it achieves an improvement in the overall loss. This will allow us to reuse the trained model at a later time without having to retrain it from scratch. This is useful for recovering models incase your computer crashes, or you want to stop the training early.
In [ ]:
filepath="-basic_LSTM.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=0, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
Now we are finally ready to train the model. We want to train the model over 50 epochs, but we also want to output some generated text after each epoch to see how our model is doing.
To do this we create our own loop to iterate over each epoch. Within the loop we first train the model for one epoch. Since all parameters are stored within the model, training one epoch at a time has the same exact effect as training over a longer series of epochs. We also use the model's validation_split
parameter to tell Keras to automatically split the data into 80% training data and 20% test data for validation. Remember to always shuffle your data if you will be using validation!
After each epoch is trained, we use the raw_text
data to extract a new sequence of 100 characters as the 'seed' for our generated text. Finally, we use our generate()
helper function to generate text using two different diversity settings.
Warning: because of their large depth (remember that an RNN trained on a 100 long sequence effectively has 100 layers!), these networks typically take a much longer time to train than traditional multi-layer ANN's and CNN's. You shoud expect these models to train overnight on the virtual machine, but you should be able to see enough progress after the first few epochs to know if it is worth it to train a model to the end. For more complex RNN models with larger data sets in your own work, you should consider a native installation, along with a dedicated GPU if possible.
In [ ]:
epochs = 50
prediction_length = 100
for iteration in range(epochs):
print('epoch:', iteration + 1, '/', epochs)
model.fit(X, y, validation_split=0.2, batch_size=256, epochs=1, callbacks=callbacks_list)
# get random starting point for seed
start_index = random.randint(0, len(raw_text) - seq_length - 1)
# extract seed sequence from raw text
seed = raw_text[start_index: start_index + seq_length]
print('----- generating with seed:', seed)
for diversity in [0.5, 1.2]:
generate(seed, prediction_length, diversity)
That looks pretty good! You can see that the RNN has learned alot of the linguistic structure of the original writing, including typical length for words, where to put spaces, and basic punctuation with commas and periods. Many words are still misspelled but seem almost reasonable, and it is pretty amazing that it is able to learn this much in only 50 epochs of training.
You can see that the loss is still going down after 50 epochs, so the model can definitely benefit from longer training. If you're curious you can try to train for more epochs, but as the error decreases be careful to monitor the output to make sure that the model is not overfitting. As with other neural network models, you can monitor the difference between training and validation loss to see if overfitting might be occuring. In this case, since we're using the model to generate new information, we can also get a sense of overfitting from the material it generates.
A good indication of overfitting is if the model outputs exactly what is in the original text given a seed from the text, but jibberish if given a seed that is not in the original text. Remember we don't want the model to learn how to reproduce exactly the original text, but to learn its style to be able to generate new text. As with other models, regularization methods such as dropout and limiting model complexity can be used to avoid the problem of overfitting.
Finally, let's save our training data and character to integer mapping dictionaries to an external file so we can reuse it with the model at a later time.
In [ ]:
pickle_file = '-basic_data.pickle'
try:
f = open(pickle_file, 'wb')
save = {
'X': X,
'y': y,
'int_to_char': int_to_char,
'char_to_int': char_to_int,
}
pickle.dump(save, f, pickle.HIGHEST_PROTOCOL)
f.close()
except Exception as e:
print('Unable to save data to', pickle_file, ':', e)
raise
statinfo = os.stat(pickle_file)
print('Saved data to', pickle_file)
print('Compressed pickle size:', statinfo.st_size)